| Planned to purchase Product A | Actually placed and order for Product A - Yes | Actually placed and order for Product A - No | Total |
|---|---|---|---|
| Yes | 400 | 100 | 500 |
| No | 200 | 1300 | 1500 |
| Total | 600 | 1400 | 2000 |
A. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order.
B. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase.
ANSWER :
A. From the table , planned to purchase and actually placed an order = 400 Total people = 2000
P(planned to purchase and actually placed an order) = planned to purchase and actually placed an order/total
= 400/2000
= 1/5
= 0.2
B. P(actually placed an order/planned to purchase) = actually placed an order / total planned to purchase
= 400/500 = 4/5 = 0.8
An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.
A. Probability that none of the items are defective?
B. Probability that exactly one of the items is defective?
C. Probability that two or fewer of the items are defective?
D. Probability that three or more of the items are defective ?
ANSWER: N(Sample size) : 10 Probability of failure/defective(5%) = 0.05 Probability of Success/non-defective(95%) = 0.95
A. P(No Defective item) = 10C0 * (P(Defective))^0 * (P(Non-Defective))^10
= 1* (0.05)^0 * 0.95^10
= 1* 1* 0.5987369392383787
= 0.5987369392383787
B. P(exactly one of the items is defective) = 10C1 * P(Defective)^1 *P(Non-Defective))^9
= 10 * 0.05 * 0.95 ^10
= 0.31512470486230454
C. P(two or fewer of the items are defective) = P(No item defective ) + P(1 item defective ) + P(2 item defective)
= 10C0 * (P(Defective))^0 * (P(Non-Defective))^10 + 10C1 * P(Defective)^1 *P(Non-Defective))^9 + 10C2 * P(Defective)^2 *P(Non-Defective))^8
= 0.5987369392383787 + 0.31512470486230454 + 45 *(0.05)^2 * 0.95^8
= 0.5987369392383787 + 0.31512470486230454 + 0.07463479852001952
= 0.9884964426207028
D. P(three or more of the items are defective ) = 1-P(two or fewer of the items are defective)
= 1- 0.9884964426207028
= 0.011503557379296881
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
n =10
p=0.05
k=np.arange(0,n+1)
binomial = stats.binom.pmf(k,n,p)
print("Question 2 : ANSWERS: \n")
print(binomial)
# Probability that none of the items are defective?
print(f"\nA: Probability that none of the items are defective : {binomial[0] } \n")
#--------------------------------------------------------------------------------------------------
# Probability that exactly one of the items is defective?
print(f"B: Probability that exactly one of the items is defective : {binomial[1] } \n")
#--------------------------------------------------------------------------------------------------
# Probability that two or fewer of the items are defective?
"""
P("two or fewer of the items are defective" ) = P(No defective) + P(1 defective ) + P(2 defective)
= binomial[0] + binomial[1] + binomial[2]
"""
probability_two_or_fewer_defective = binomial[0] + binomial[1] + binomial[2]
print(f"C: Probability that two or fewer of the items are defective? : {probability_two_or_fewer_defective } \n")
#--------------------------------------------------------------------------------------------------
# Probability that three or more of the items are defective ?
"""
P("three or more of the items are defective" ) = P(3 defective) + ..... +P(10 defective )
= binomial[3] + binomial[4] + binomial[5] + binomial[6]
+ binomial[7] + binomial[8] + binomial[9] + binomial[10]
OR
= 1-(probability_two_or_fewer_defective)
"""
probability_three_or_more_defective = binomial[3] + binomial[4] + binomial[5] + binomial[6] + binomial[7] + binomial[8] + binomial[9] + binomial[10]
print(f"D: Probability that three or more of the items are defective ? : {probability_three_or_more_defective } \n")
print(f"\n OR \n Probability that three or more of the items are defective = 1-(probability_two_or_fewer_defective) :{1-probability_two_or_fewer_defective}")
#--------------------------------------------------------------------------------------------------
Question 2 : ANSWERS: [5.98736939e-01 3.15124705e-01 7.46347985e-02 1.04750594e-02 9.64808106e-04 6.09352488e-05 2.67259863e-06 8.03789063e-08 1.58642578e-09 1.85546875e-11 9.76562500e-14] A: Probability that none of the items are defective : 0.5987369392383789 B: Probability that exactly one of the items is defective : 0.31512470486230504 C: Probability that two or fewer of the items are defective? : 0.9884964426207035 D: Probability that three or more of the items are defective ? : 0.011503557379296881 OR Probability that three or more of the items are defective = 1-(probability_two_or_fewer_defective) :0.011503557379296536
Question: A car salesman sells on an average 3 cars per week.
A. Probability that in a given week he will sell some cars.
B. Probability that in a given week he will sell 2 or more but less than 5 cars.
C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per-week.
rate = 3
n=np.arange(0,10)
poisson = stats.poisson.pmf(n,rate)
print("Question 3 : ANSWERS: \n")
print(poisson)
#--------------------------------------------------------------------------------------------------
# A. Probability that in a given week he will sell some cars.
"""
P(sell some cars in week) = 1-P(no cars sell in week)
= 1- poisson[0]
= 1- 0.04978707
= 0.9502129
"""
print(f"\n A : Probability that in a given week he will sell some cars : {1- poisson[0] } \n")
#--------------------------------------------------------------------------------------------------
# B. Probability that in a given week he will sell 2 or more but less than 5 cars.
"""
P(sell 2 or more but less than 5 cars) = P(sell 2 cars) + P(sell 3 cars) + P(sell 4 cars)
= poisson[2] + poisson[3] + poisson[4]
= 0.22404181 + 0.22404181 + 0.16803136
= 0.61611498
"""
print(f"\n B : Probability that in a given week he will sell 2 or more but less than 5 cars. : {poisson[2] + poisson[3] + poisson[4] } \n")
#--------------------------------------------------------------------------------------------------
# C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per-week.
print(f"\n C : Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per-week. \n")
plt.plot(n,poisson,'o-')
plt.title('Poisson: $\lambda$ = %i ' % rate)
plt.xlabel('Number of cars sold per-week')
plt.ylabel('Probability of cars sold per-week')
plt.show()
#--------------------------------------------------------------------------------------------------
Question 3 : ANSWERS: [0.04978707 0.14936121 0.22404181 0.22404181 0.16803136 0.10081881 0.05040941 0.02160403 0.00810151 0.0027005 ] A : Probability that in a given week he will sell some cars : 0.950212931632136 B : Probability that in a given week he will sell 2 or more but less than 5 cars. : 0.6161149710523164 C : Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per-week.
Question: Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.
A. What is the probability that all three orders will be recognised correctly?
B. What is the probability that none of the three orders will be recognised correctly?
C. What is the probability that at least two of the three orders will be recognised correctly?
A. What is the probability that all three orders will be recognised correctly?
P(all three orders will be recognised correctly) = 3C3 *(P(Correct))^3 * (1-P(Correct))^0
= 1 * (0.868)^3 * 1
= 0.653972032
B. What is the probability that none of the three orders will be recognised correctly)?
P(none of the three orders will be recognised correctly) = 3C0 *P(Correct)^0 * (1-P(Correct))^3
= 1* 1 * (1- 0.868)^3
= 0.0022999680000000003
C. What is the probability that at least two of the three orders will be recognised correctly?
P(at least two of the three orders will be recognised correctly)
= P(2 order recognised correctly) + P(3 order recognised correctly)
= 3C2 *P(Correct)^2 * (1-P(Correct))^1 + 3C3 *P(Correct)^3 * (1-P(Correct))^0
= 3 * 0.868 ^ 2 * (1-0.868 )^1 + 1 * 0.868 ^ 3 * (1-0.868 )^0
= 3 * 0.868 ^ 2 * (0.132)^1 + 1 * 0.868 ^ 3 * 1
= 0.952327936
size =3
p_correct=0.868
orders=np.arange(0,size+1)
bin_order = stats.binom.pmf(orders,size,p_correct)
print("Question 4 : ANSWERS: \n")
print(bin_order)
#--------------------------------------------------------------------------------------------------
# A. What is the probability that all three orders will be recognised correctly?
print(f"\nA: Probability that all three orders will be recognised correctly : {bin_order[3] } \n")
#--------------------------------------------------------------------------------------------------
# B. What is the probability that none of the three orders will be recognised correctly?
print(f"\nB: Probability that none of the three orders will be recognised correctly : {bin_order[0] } \n")
#--------------------------------------------------------------------------------------------------
# C. What is the probability that at least two of the three orders will be recognised correctly?
print(f"\nC: Probability that at least two of the three orders will be recognised correctly : {bin_order[2] + bin_order[3]} \n")
Question 4 : ANSWERS: [0.00229997 0.0453721 0.2983559 0.65397203] A: Probability that all three orders will be recognised correctly : 0.653972032 B: Probability that none of the three orders will be recognised correctly : 0.002299968 C: Probability that at least two of the three orders will be recognised correctly : 0.9523279359999999
A. What is the percentage of students who score more than 80.
B. What is the percentage of students who score less than 50.
C. What should be the distinction mark if the highest 10% of students are to be awarded distinction?
normal_n = 300
mean = 60
std = 12
print("Question 5 : ANSWERS: \n")
#--------------------------------------------------------------------------------------------------
# A. What is the percentage of students who score more than 80.
z_more_than_80= (80 - 60)/12
percentage_of_students_score_more_than_80 = stats.norm.cdf(z_more_than_80)
print("A: percentage of students who score more than 80 by z score calculation = ")
print((1- percentage_of_students_score_more_than_80)*100 )
print(f"\n percentage of students who score more than 80 : {(1 - stats.norm.cdf(80,loc=60,scale=12)) *100 } % \n")
#--------------------------------------------------------------------------------------------------
# B. What is the percentage of students who score less than 50.
z_less_than_50= (50 - 60)/12
percentage_of_students_less_than_50= stats.norm.cdf(z_less_than_50)
print("B: percentage of students who score less than 50 = ")
print((percentage_of_students_less_than_50)*100 )
print(f"\n percentage of students who score less than 50 : {(stats.norm.cdf(50,loc=60,scale=12)) *100 } % \n")
#--------------------------------------------------------------------------------------------------
# C. What should be the distinction mark if the highest 10% of students are to be awarded distinction?
import scipy.stats as st
z_90_percent = st.norm.ppf(.90)
x= z_90_percent * std + mean
print(f"C: Distinction mark if the highest 10% of students are to be awarded distinction is : {x}")
print("\n\nVerification : ")
print(f"percentage at value > x : {1- stats.norm.cdf(x,loc=60,scale=12)}")
Question 5 : ANSWERS: A: percentage of students who score more than 80 by z score calculation = 4.77903522728147 percentage of students who score more than 80 : 4.77903522728147 % B: percentage of students who score less than 50 = 20.232838096364308 percentage of students who score less than 50 : 20.232838096364308 % C: Distinction mark if the highest 10% of students are to be awarded distinction is : 75.3786187865352 Verification : percentage at value > x : 0.10000000000000009
ANSWER :
Detecting tumors in brain scan. Getting the data on the brain tumor scan from past patients data, we can anylyse the data to predict the location and shape of the tumors. Using the analysis we can determine the relationship between various details like life expenctency, recommended type of treatement, size of tumor, location of tumor, level of tumors and other details. This will help us contructing hypothesis and predicting the probable solution for Health care to focus on and help in preliminary diagnosis.
Companies use statistics in market research and new product development. We can take random surveys of consumers to gauge the market acceptance and potential for a proposed product. We can pitch executives if there will be enough demand for the product. Is there enough demand to justify spending money to develop the product and, ultimately, to build a plant to produce it? From the statistical analysis, a break-even model is constructed to determine the volume of sales necessary for the product to succeed.
• DOMAIN: Sports
• CONTEXT: Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.
• DATA DESCRIPTION: Basketball.csv - The data set contains information on all the teams so far participated in all the past tournaments.
• ATTRIBUTE INFORMATION:
1. Team: Team’s name
2. Tournament: Number of played tournaments.
3. Score: Team’s score so far.
4. PlayedGames: Games played by the team so far.
5. WonGames: Games won by the team so far.
6. DrawnGames: Games drawn by the team so far.
7. LostGames: Games lost by the team so far.
8. BasketScored: Basket scored by the team so far.
9. BasketGiven: Basket scored against the team so far.
10. TournamentChampion: How many times the team was a champion of the tournaments so far.
11. Runner-up: How many times the team was a runners-up of the tournaments so far.
12. TeamLaunch: Year the team was launched on professional basketball.
13. HighestPositionHeld: Highest position held by the team amongst all the tournaments played.
• PROJECT OBJECTIVE: Company’s management wants to invest on proposal on managing some of the best teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True) # adds a nice background to the graphs
%matplotlib inline
Data = pd.read_csv('DS - Part2 - Basketball.csv') # Import the dataset named 'Admission_predict.csv'
Data.head() # view the first 5 rows of the data
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
Here we can see that there are 13 rows.
Team is categorical and unique data here.
Rest all data are numberical.
TeamLaunch is temporal data containing year or duration
Data.shape # see the shape of the data
(61, 13)
The dataset has 61 rows and 13 columns
Data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null object 3 PlayedGames 61 non-null object 4 WonGames 61 non-null object 5 DrawnGames 61 non-null object 6 LostGames 61 non-null object 7 BasketScored 61 non-null object 8 BasketGiven 61 non-null object 9 TournamentChampion 61 non-null object 10 Runner-up 61 non-null object 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(2), object(11) memory usage: 6.3+ KB
As we can see most of the data are object data type though their actual value is numerical. Hence we have to convert these data to numerical so that we can use it for processing further.
**This will be done later point**
Data.isnull().sum().sum()
0
No Null Data Present
dupes = Data.duplicated()
sum(dupes)
0
No duplicate data found
Data.isnull().values.any() # Any of the values in the dataframe is a missing value
False
No Null Data Present
Data.tail()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | - | - | 2009-10 | 20 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | - | - | 1956-57 | 16 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | - | - | 1951~52 | 16 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | - | - | 1955-56 | 15 |
| 60 | Team 61 | 1 | - | - | - | - | - | - | - | - | - | 2017~18 | 9 |
Data.TournamentChampion.value_counts()
- 52 1 3 2 1 25 1 33 1 8 1 10 1 6 1 Name: TournamentChampion, dtype: int64
Data['Runner-up'].value_counts()
- 48 1 5 23 1 6 1 25 1 3 1 5 1 8 1 7 1 4 1 Name: Runner-up, dtype: int64
Since TournamentChampion(48 -) and Runner-up(52 -) are having most entries as '-', we will be replacing with 0 which means they havent won any tournament or runner-up. This will be best solution as replacing with average or removing the data will not be apt solution.
If we remove the data with '-', we will have only 9 data or lesser left which cant be analyse due to less sample size
Hence replacing the blank data '-' with value 0 as we are having missing data on them(Assuming they havent played matches). Hence making data type uniform.
Data.Score.replace('-',0,inplace= True)
Data.PlayedGames.replace('-',0,inplace= True)
Data.WonGames.replace('-',0,inplace= True)
Data.DrawnGames.replace('-',0,inplace= True)
Data.LostGames.replace('-',0,inplace= True)
Data.BasketScored.replace('-',0,inplace= True)
Data.BasketGiven.replace('-',0,inplace= True)
Data.TournamentChampion.replace('-',0,inplace= True)
Data['Runner-up'].replace('-',0,inplace= True)
Data.tail()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009-10 | 20 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956-57 | 16 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951~52 | 16 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955-56 | 15 |
| 60 | Team 61 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017~18 | 9 |
Since the column TeamLaunch data is not uniform and contain year and duration, creating a new column with capturing the startng year by parsing the data on the column
Data["TeamLaunchStartYear"] = Data["TeamLaunch"].apply(lambda x : x.split("-")[0])
Data["TeamLaunchStartYear"] = Data["TeamLaunchStartYear"].apply(lambda x : x.split("~")[0])
Data["TeamLaunchStartYear"] = Data["TeamLaunchStartYear"].apply(lambda x : x.split("to")[0])
Data["TeamLaunchStartYear"] = Data["TeamLaunchStartYear"].apply(lambda x : x.split("_")[0])
Data
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | TeamLaunchStartYear | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | 1929 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | 1929 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | 1929 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 | 1931 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | 1929 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009-10 | 20 | 2009 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956-57 | 16 | 1956 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951~52 | 16 | 1951 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955-56 | 15 | 1955 |
| 60 | Team 61 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017~18 | 9 | 2017 |
61 rows × 14 columns
Convert the object data to numeric so that the functions can be applied
Data.columns
Data[['Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames', 'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'HighestPositionHeld','TeamLaunchStartYear']] = Data[['Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion','Runner-up', 'HighestPositionHeld',
'TeamLaunchStartYear']].apply(pd.to_numeric)
Data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null int64 3 PlayedGames 61 non-null int64 4 WonGames 61 non-null int64 5 DrawnGames 61 non-null int64 6 LostGames 61 non-null int64 7 BasketScored 61 non-null int64 8 BasketGiven 61 non-null int64 9 TournamentChampion 61 non-null int64 10 Runner-up 61 non-null int64 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 13 TeamLaunchStartYear 61 non-null int64 dtypes: int64(12), object(2) memory usage: 6.8+ KB
Insert colums below for better analysis
1. Percentage_won = WonGames /PlayedGames *100
2. Percentage_BasketScored = BasketScored/ (BasketScored + BasketGiven) *100
3. Finalist = TournamentChampion + Runner-up
4. percentage_Finalist = (TournamentChampion + Runner-up ) / Tournament *100
5. percentage_TournamentChampion = (TournamentChampion) / Tournament
Data["Percentage_won"] = (Data["WonGames"]/Data["PlayedGames"])* 100
Data["Percentage_BasketScored"] = (Data["BasketScored"]/(Data["BasketScored"] + Data["BasketGiven"]))* 100
Data["Finalist"] = (Data["TournamentChampion"]+Data["Runner-up"])
Data["percentage_Finalist"] = ((Data["TournamentChampion"] + Data["Runner-up"])/Data["Tournament"])* 100
Data["percentage_TournamentChampion"] = (Data["TournamentChampion"]/Data["Tournament"])* 100
Data
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | TeamLaunchStartYear | Percentage_won | Percentage_BasketScored | Finalist | percentage_Finalist | percentage_TournamentChampion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | 1929 | 59.630702 | 65.445141 | 56 | 65.116279 | 38.372093 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | 1929 | 57.241130 | 65.453739 | 50 | 58.139535 | 29.069767 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | 1929 | 47.475134 | 57.809512 | 18 | 22.500000 | 12.500000 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 | 1931 | 44.557057 | 55.904411 | 12 | 14.634146 | 7.317073 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | 1929 | 43.772629 | 55.587565 | 15 | 17.441860 | 9.302326 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56 | Team 57 | 1 | 34 | 38 | 8 | 10 | 20 | 38 | 66 | 0 | 0 | 2009-10 | 20 | 2009 | 21.052632 | 36.538462 | 0 | 0.000000 | 0.000000 |
| 57 | Team 58 | 1 | 22 | 30 | 7 | 8 | 15 | 37 | 57 | 0 | 0 | 1956-57 | 16 | 1956 | 23.333333 | 39.361702 | 0 | 0.000000 | 0.000000 |
| 58 | Team 59 | 1 | 19 | 30 | 7 | 5 | 18 | 51 | 85 | 0 | 0 | 1951~52 | 16 | 1951 | 23.333333 | 37.500000 | 0 | 0.000000 | 0.000000 |
| 59 | Team 60 | 1 | 14 | 30 | 5 | 4 | 21 | 34 | 65 | 0 | 0 | 1955-56 | 15 | 1955 | 16.666667 | 34.343434 | 0 | 0.000000 | 0.000000 |
| 60 | Team 61 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2017~18 | 9 | 2017 | NaN | NaN | 0 | 0.000000 | 0.000000 |
61 rows × 19 columns
Data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null int64 3 PlayedGames 61 non-null int64 4 WonGames 61 non-null int64 5 DrawnGames 61 non-null int64 6 LostGames 61 non-null int64 7 BasketScored 61 non-null int64 8 BasketGiven 61 non-null int64 9 TournamentChampion 61 non-null int64 10 Runner-up 61 non-null int64 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 13 TeamLaunchStartYear 61 non-null int64 14 Percentage_won 60 non-null float64 15 Percentage_BasketScored 60 non-null float64 16 Finalist 61 non-null int64 17 percentage_Finalist 61 non-null float64 18 percentage_TournamentChampion 61 non-null float64 dtypes: float64(4), int64(13), object(2) memory usage: 9.2+ KB
Data.describe()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | HighestPositionHeld | TeamLaunchStartYear | Percentage_won | Percentage_BasketScored | Finalist | percentage_Finalist | percentage_TournamentChampion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 61.000000 | 60.000000 | 60.000000 | 61.000000 | 61.000000 | 61.000000 |
| mean | 24.000000 | 901.426230 | 796.819672 | 303.967213 | 188.934426 | 303.754098 | 1140.344262 | 1140.229508 | 1.426230 | 1.409836 | 7.081967 | 1958.918033 | 31.364790 | 43.610198 | 2.836066 | 3.645135 | 1.720841 |
| std | 26.827225 | 1134.899121 | 876.282765 | 406.991030 | 201.799477 | 294.708594 | 1506.740211 | 1163.710766 | 5.472535 | 4.540107 | 5.276663 | 27.484114 | 7.831199 | 6.777215 | 9.941798 | 11.670006 | 6.392672 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1929.000000 | 16.666667 | 27.777778 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.000000 | 96.000000 | 114.000000 | 34.000000 | 24.000000 | 62.000000 | 153.000000 | 221.000000 | 0.000000 | 0.000000 | 3.000000 | 1935.000000 | 27.607494 | 39.742084 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 12.000000 | 375.000000 | 423.000000 | 123.000000 | 95.000000 | 197.000000 | 430.000000 | 632.000000 | 0.000000 | 0.000000 | 6.000000 | 1951.000000 | 30.491722 | 42.399042 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 38.000000 | 1351.000000 | 1318.000000 | 426.000000 | 330.000000 | 563.000000 | 1642.000000 | 1951.000000 | 0.000000 | 0.000000 | 10.000000 | 1978.000000 | 33.540164 | 45.493620 | 0.000000 | 0.000000 | 0.000000 |
| max | 86.000000 | 4385.000000 | 2762.000000 | 1647.000000 | 633.000000 | 1070.000000 | 5947.000000 | 3889.000000 | 33.000000 | 25.000000 | 20.000000 | 2017.000000 | 59.630702 | 65.453739 | 56.000000 | 65.116279 | 38.372093 |
print(f"Mean: \n{Data.mean()}")
Mean: Tournament 24.000000 Score 901.426230 PlayedGames 796.819672 WonGames 303.967213 DrawnGames 188.934426 LostGames 303.754098 BasketScored 1140.344262 BasketGiven 1140.229508 TournamentChampion 1.426230 Runner-up 1.409836 HighestPositionHeld 7.081967 TeamLaunchStartYear 1958.918033 Percentage_won 31.364790 Percentage_BasketScored 43.610198 Finalist 2.836066 percentage_Finalist 3.645135 percentage_TournamentChampion 1.720841 dtype: float64
print(f"Median:\n{Data.median()}")
Median: Tournament 12.000000 Score 375.000000 PlayedGames 423.000000 WonGames 123.000000 DrawnGames 95.000000 LostGames 197.000000 BasketScored 430.000000 BasketGiven 632.000000 TournamentChampion 0.000000 Runner-up 0.000000 HighestPositionHeld 6.000000 TeamLaunchStartYear 1951.000000 Percentage_won 30.491722 Percentage_BasketScored 42.399042 Finalist 0.000000 percentage_Finalist 0.000000 percentage_TournamentChampion 0.000000 dtype: float64
Since there are no categorical data Mode is not required
print("Data_quantile(25%):\n",Data.quantile(q=0.25))
Data_quantile(25%): Tournament 4.000000 Score 96.000000 PlayedGames 114.000000 WonGames 34.000000 DrawnGames 24.000000 LostGames 62.000000 BasketScored 153.000000 BasketGiven 221.000000 TournamentChampion 0.000000 Runner-up 0.000000 HighestPositionHeld 3.000000 TeamLaunchStartYear 1935.000000 Percentage_won 27.607494 Percentage_BasketScored 39.742084 Finalist 0.000000 percentage_Finalist 0.000000 percentage_TournamentChampion 0.000000 Name: 0.25, dtype: float64
print("Data_quantile(75%):\n",Data.quantile(q=0.75))
Data_quantile(75%): Tournament 38.000000 Score 1351.000000 PlayedGames 1318.000000 WonGames 426.000000 DrawnGames 330.000000 LostGames 563.000000 BasketScored 1642.000000 BasketGiven 1951.000000 TournamentChampion 0.000000 Runner-up 0.000000 HighestPositionHeld 10.000000 TeamLaunchStartYear 1978.000000 Percentage_won 33.540164 Percentage_BasketScored 45.493620 Finalist 0.000000 percentage_Finalist 0.000000 percentage_TournamentChampion 0.000000 Name: 0.75, dtype: float64
sns.boxplot(x=Data["Score"])
<AxesSubplot:xlabel='Score'>
It can be observed that there are outliers in the this case
Data.quantile(0.75) - Data.quantile(0.25)
Tournament 34.000000 Score 1255.000000 PlayedGames 1204.000000 WonGames 392.000000 DrawnGames 306.000000 LostGames 501.000000 BasketScored 1489.000000 BasketGiven 1730.000000 TournamentChampion 0.000000 Runner-up 0.000000 HighestPositionHeld 7.000000 TeamLaunchStartYear 43.000000 Percentage_won 5.932670 Percentage_BasketScored 5.751537 Finalist 0.000000 percentage_Finalist 0.000000 percentage_TournamentChampion 0.000000 dtype: float64
data_values= Data.drop(['Team','TeamLaunch'], 1)
print(data_values.max() - data_values.min())
Tournament 85.000000 Score 4385.000000 PlayedGames 2762.000000 WonGames 1647.000000 DrawnGames 633.000000 LostGames 1070.000000 BasketScored 5947.000000 BasketGiven 3889.000000 TournamentChampion 33.000000 Runner-up 25.000000 HighestPositionHeld 19.000000 TeamLaunchStartYear 88.000000 Percentage_won 42.964036 Percentage_BasketScored 37.675961 Finalist 56.000000 percentage_Finalist 65.116279 percentage_TournamentChampion 38.372093 dtype: float64
print(Data.var())
Tournament 7.197000e+02 Score 1.287996e+06 PlayedGames 7.678715e+05 WonGames 1.656417e+05 DrawnGames 4.072303e+04 LostGames 8.685316e+04 BasketScored 2.270266e+06 BasketGiven 1.354223e+06 TournamentChampion 2.994863e+01 Runner-up 2.061257e+01 HighestPositionHeld 2.784317e+01 TeamLaunchStartYear 7.553765e+02 Percentage_won 6.132769e+01 Percentage_BasketScored 4.593064e+01 Finalist 9.883934e+01 percentage_Finalist 1.361890e+02 percentage_TournamentChampion 4.086626e+01 dtype: float64
print(Data.std())
Tournament 26.827225 Score 1134.899121 PlayedGames 876.282765 WonGames 406.991030 DrawnGames 201.799477 LostGames 294.708594 BasketScored 1506.740211 BasketGiven 1163.710766 TournamentChampion 5.472535 Runner-up 4.540107 HighestPositionHeld 5.276663 TeamLaunchStartYear 27.484114 Percentage_won 7.831199 Percentage_BasketScored 6.777215 Finalist 9.941798 percentage_Finalist 11.670006 percentage_TournamentChampion 6.392672 dtype: float64
Data.cov()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | HighestPositionHeld | TeamLaunchStartYear | Percentage_won | Percentage_BasketScored | Finalist | percentage_Finalist | percentage_TournamentChampion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tournament | 719.700000 | 2.988113e+04 | 2.347713e+04 | 10612.216667 | 5356.266667 | 7509.816667 | 3.941348e+04 | 3.083912e+04 | 86.483333 | 78.666667 | -100.233333 | -444.716667 | 172.548918 | 150.699173 | 165.150000 | 199.392048 | 103.011491 |
| Score | 29881.133333 | 1.287996e+06 | 9.744279e+05 | 460619.130874 | 219506.661749 | 294342.006557 | 1.704281e+06 | 1.247083e+06 | 4436.231967 | 3937.989071 | -4010.935519 | -16950.564481 | 7821.523453 | 6814.073405 | 8374.221038 | 10051.332600 | 5260.037711 |
| PlayedGames | 23477.133333 | 9.744279e+05 | 7.678715e+05 | 345098.710656 | 175781.754645 | 247015.154918 | 1.280888e+06 | 1.009673e+06 | 2756.044809 | 2518.025137 | -3286.984973 | -14148.081694 | 5587.847758 | 4904.902488 | 5274.069945 | 6385.627893 | 3285.932647 |
| WonGames | 10612.216667 | 4.606191e+05 | 3.450987e+05 | 165641.698907 | 77189.964481 | 102286.208470 | 6.128115e+05 | 4.386830e+05 | 1675.364208 | 1473.330328 | -1392.963934 | -6083.436066 | 2828.717303 | 2448.316275 | 3148.694536 | 3760.163214 | 1982.646239 |
| DrawnGames | 5356.266667 | 2.195067e+05 | 1.757818e+05 | 77189.964481 | 40723.028962 | 57875.583607 | 2.866269e+05 | 2.330936e+05 | 556.011749 | 518.810656 | -766.511202 | -3203.622131 | 1224.962602 | 1091.152204 | 1074.822404 | 1314.123563 | 666.436653 |
| LostGames | 7509.816667 | 2.943420e+05 | 2.470152e+05 | 102286.208470 | 57875.583607 | 86853.155191 | 3.815134e+05 | 3.379055e+05 | 524.906557 | 526.119126 | -1127.662842 | -4864.037158 | 1534.570745 | 1365.855138 | 1051.025683 | 1311.948638 | 637.136562 |
| BasketScored | 39413.483333 | 1.704281e+06 | 1.280888e+06 | 612811.511475 | 286626.939617 | 381513.436066 | 2.270266e+06 | 1.633385e+06 | 6127.734153 | 5407.839891 | -5163.528689 | -22919.254645 | 10409.944547 | 9020.750868 | 11535.574044 | 13765.597721 | 7249.982029 |
| BasketGiven | 30839.116667 | 1.247083e+06 | 1.009673e+06 | 438683.007650 | 233093.615301 | 337905.490710 | 1.633385e+06 | 1.354223e+06 | 3004.783880 | 2820.337705 | -4461.752459 | -19692.747541 | 6902.673882 | 6051.576632 | 5825.121585 | 7119.868702 | 3601.365568 |
| TournamentChampion | 86.483333 | 4.436232e+03 | 2.756045e+03 | 1675.364208 | 556.011749 | 524.906557 | 6.127734e+03 | 3.004784e+03 | 29.948634 | 24.139071 | -8.818852 | -42.847814 | 33.058040 | 27.239488 | 54.087705 | 62.899463 | 34.964972 |
| Runner-up | 78.666667 | 3.937989e+03 | 2.518025e+03 | 1473.330328 | 518.810656 | 526.119126 | 5.407840e+03 | 2.820338e+03 | 24.139071 | 20.612568 | -8.634153 | -39.415847 | 28.704845 | 23.941503 | 44.751639 | 52.638445 | 28.258179 |
| HighestPositionHeld | -100.233333 | -4.010936e+03 | -3.286985e+03 | -1392.963934 | -766.511202 | -1127.662842 | -5.163529e+03 | -4.461752e+03 | -8.818852 | -8.634153 | 27.843169 | 85.406831 | -30.689361 | -25.763672 | -17.453005 | -22.285164 | -10.640531 |
| TeamLaunchStartYear | -444.716667 | -1.695056e+04 | -1.414808e+04 | -6083.436066 | -3203.622131 | -4864.037158 | -2.291925e+04 | -1.969275e+04 | -42.847814 | -39.415847 | 85.406831 | 755.376503 | -102.513872 | -71.174010 | -82.263661 | -98.304990 | -51.441691 |
| Percentage_won | 172.548918 | 7.821523e+03 | 5.587848e+03 | 2828.717303 | 1224.962602 | 1534.570745 | 1.040994e+04 | 6.902674e+03 | 33.058040 | 28.704845 | -30.689361 | -102.513872 | 61.327685 | 50.399847 | 61.762885 | 74.256747 | 38.991248 |
| Percentage_BasketScored | 150.699173 | 6.814073e+03 | 4.904902e+03 | 2448.316275 | 1091.152204 | 1365.855138 | 9.020751e+03 | 6.051577e+03 | 27.239488 | 23.941503 | -25.763672 | -71.174010 | 50.399847 | 45.930644 | 51.180991 | 61.709116 | 32.166062 |
| Finalist | 165.150000 | 8.374221e+03 | 5.274070e+03 | 3148.694536 | 1074.822404 | 1051.025683 | 1.153557e+04 | 5.825122e+03 | 54.087705 | 44.751639 | -17.453005 | -82.263661 | 61.762885 | 51.180991 | 98.839344 | 115.537908 | 63.223151 |
| percentage_Finalist | 199.392048 | 1.005133e+04 | 6.385628e+03 | 3760.163214 | 1314.123563 | 1311.948638 | 1.376560e+04 | 7.119869e+03 | 62.899463 | 52.638445 | -22.285164 | -98.304990 | 74.256747 | 61.709116 | 115.537908 | 136.189029 | 73.666006 |
| percentage_TournamentChampion | 103.011491 | 5.260038e+03 | 3.285933e+03 | 1982.646239 | 666.436653 | 637.136562 | 7.249982e+03 | 3.601366e+03 | 34.964972 | 28.258179 | -10.640531 | -51.441691 | 38.991248 | 32.166062 | 63.223151 | 73.666006 | 40.866261 |
Data.corr()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | HighestPositionHeld | TeamLaunchStartYear | Percentage_won | Percentage_BasketScored | Finalist | percentage_Finalist | percentage_TournamentChampion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tournament | 1.000000 | 0.981441 | 0.998677 | 0.971954 | 0.989387 | 0.949863 | 0.975059 | 0.987828 | 0.589072 | 0.645876 | -0.708071 | -0.603151 | 0.819559 | 0.827096 | 0.619210 | 0.636885 | 0.600658 |
| Score | 0.981441 | 1.000000 | 0.979824 | 0.997240 | 0.958452 | 0.880040 | 0.996656 | 0.944263 | 0.714280 | 0.764278 | -0.669775 | -0.543432 | 0.877385 | 0.883248 | 0.742202 | 0.758919 | 0.725019 |
| PlayedGames | 0.998677 | 0.979824 | 1.000000 | 0.967641 | 0.994053 | 0.956503 | 0.970127 | 0.990129 | 0.574716 | 0.632921 | -0.710876 | -0.587451 | 0.813179 | 0.824801 | 0.605392 | 0.624436 | 0.586586 |
| WonGames | 0.971954 | 0.997240 | 0.967641 | 1.000000 | 0.939844 | 0.852785 | 0.999318 | 0.926234 | 0.752204 | 0.797350 | -0.648628 | -0.543854 | 0.884278 | 0.884390 | 0.778181 | 0.791682 | 0.762040 |
| DrawnGames | 0.989387 | 0.958452 | 0.994053 | 0.939844 | 1.000000 | 0.973156 | 0.942668 | 0.992579 | 0.503472 | 0.566269 | -0.719845 | -0.577616 | 0.774416 | 0.797102 | 0.535737 | 0.558014 | 0.516602 |
| LostGames | 0.949863 | 0.880040 | 0.956503 | 0.852785 | 0.973156 | 1.000000 | 0.859169 | 0.985275 | 0.325462 | 0.393211 | -0.725149 | -0.600513 | 0.665366 | 0.684314 | 0.358720 | 0.381463 | 0.338187 |
| BasketScored | 0.975059 | 0.996656 | 0.970127 | 0.999318 | 0.942668 | 0.859169 | 1.000000 | 0.931548 | 0.743144 | 0.790532 | -0.649455 | -0.553453 | 0.879124 | 0.880281 | 0.770080 | 0.782863 | 0.752690 |
| BasketGiven | 0.987828 | 0.944263 | 0.990129 | 0.926234 | 0.992579 | 0.985275 | 0.931548 | 1.000000 | 0.471824 | 0.533814 | -0.726610 | -0.615715 | 0.757279 | 0.767157 | 0.503495 | 0.524271 | 0.484105 |
| TournamentChampion | 0.589072 | 0.714280 | 0.574716 | 0.752204 | 0.503472 | 0.325462 | 0.743144 | 0.471824 | 1.000000 | 0.971552 | -0.305397 | -0.284878 | 0.765351 | 0.728718 | 0.994134 | 0.984889 | 0.999453 |
| Runner-up | 0.645876 | 0.764278 | 0.632921 | 0.797350 | 0.566269 | 0.393211 | 0.790532 | 0.533814 | 0.971552 | 1.000000 | -0.360408 | -0.315881 | 0.801247 | 0.772217 | 0.991466 | 0.993496 | 0.973634 |
| HighestPositionHeld | -0.708071 | -0.669775 | -0.710876 | -0.648628 | -0.719845 | -0.725149 | -0.649455 | -0.726610 | -0.305397 | -0.360408 | 1.000000 | 0.588914 | -0.737288 | -0.715211 | -0.332695 | -0.361897 | -0.315443 |
| TeamLaunchStartYear | -0.603151 | -0.543432 | -0.587451 | -0.543854 | -0.577616 | -0.600513 | -0.553453 | -0.615715 | -0.284878 | -0.315881 | 0.588914 | 1.000000 | -0.491259 | -0.394118 | -0.301066 | -0.306495 | -0.292787 |
| Percentage_won | 0.819559 | 0.877385 | 0.813179 | 0.884278 | 0.774416 | 0.665366 | 0.879124 | 0.757279 | 0.765351 | 0.801247 | -0.737288 | -0.491259 | 1.000000 | 0.949620 | 0.787199 | 0.806392 | 0.772811 |
| Percentage_BasketScored | 0.827096 | 0.883248 | 0.824801 | 0.884390 | 0.797102 | 0.684314 | 0.880281 | 0.767157 | 0.728718 | 0.772217 | -0.715211 | -0.394118 | 0.949620 | 1.000000 | 0.753776 | 0.774349 | 0.736684 |
| Finalist | 0.619210 | 0.742202 | 0.605392 | 0.778181 | 0.535737 | 0.358720 | 0.770080 | 0.503495 | 0.994134 | 0.991466 | -0.332695 | -0.301066 | 0.787199 | 0.753776 | 1.000000 | 0.995838 | 0.994784 |
| percentage_Finalist | 0.636885 | 0.758919 | 0.624436 | 0.791682 | 0.558014 | 0.381463 | 0.782863 | 0.524271 | 0.984889 | 0.993496 | -0.361897 | -0.306495 | 0.806392 | 0.774349 | 0.995838 | 1.000000 | 0.987447 |
| percentage_TournamentChampion | 0.600658 | 0.725019 | 0.586586 | 0.762040 | 0.516602 | 0.338187 | 0.752690 | 0.484105 | 0.999453 | 0.973634 | -0.315443 | -0.292787 | 0.772811 | 0.736684 | 0.994784 | 0.987447 | 1.000000 |
sns.pairplot(Data, kind="reg")
<seaborn.axisgrid.PairGrid at 0x7f92bf519370>
It can be observed that most of the data are right skewed. Percentage scored and percentage won is normal Data are correlated and linear in nature with outliers also
fig,ax = plt.subplots(figsize=(10, 10))
sns.heatmap(Data.corr(), ax=ax, annot=True, linewidths=0.05, fmt= '.2f',cmap="magma")
plt.show()
Data.skew()
Tournament 1.217038 Score 1.593109 PlayedGames 1.141978 WonGames 1.805728 DrawnGames 1.004159 LostGames 0.897130 BasketScored 1.777436 BasketGiven 0.975859 TournamentChampion 4.777021 Runner-up 4.360643 HighestPositionHeld 0.817976 TeamLaunchStartYear 0.672956 Percentage_won 1.440046 Percentage_BasketScored 1.189488 Finalist 4.562439 percentage_Finalist 4.374451 percentage_TournamentChampion 4.702287 dtype: float64
As shown, most frequent values are low and tail is towards high values.
plt.hist(Data['Score'], bins=50)
(array([14., 7., 4., 4., 4., 2., 3., 2., 0., 0., 0., 2., 0.,
2., 1., 2., 2., 0., 0., 0., 2., 1., 0., 0., 1., 0.,
0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 2.,
1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.]),
array([ 0. , 87.7, 175.4, 263.1, 350.8, 438.5, 526.2, 613.9,
701.6, 789.3, 877. , 964.7, 1052.4, 1140.1, 1227.8, 1315.5,
1403.2, 1490.9, 1578.6, 1666.3, 1754. , 1841.7, 1929.4, 2017.1,
2104.8, 2192.5, 2280.2, 2367.9, 2455.6, 2543.3, 2631. , 2718.7,
2806.4, 2894.1, 2981.8, 3069.5, 3157.2, 3244.9, 3332.6, 3420.3,
3508. , 3595.7, 3683.4, 3771.1, 3858.8, 3946.5, 4034.2, 4121.9,
4209.6, 4297.3, 4385. ]),
<BarContainer object of 50 artists>)
sns.boxplot(y=Data["Score"])
<AxesSubplot:ylabel='Score'>
As shown in histogram, frequency of team is distributed around different scores. There are team with highers scores like greater than 1K, 2K, 3K and 4K. Since score is fairly good indicator of performance(as shown in correlation) we can focus on team with higher scores.
plt.hist(Data['Percentage_won'], bins=50)
(array([1., 0., 1., 0., 0., 2., 0., 4., 2., 3., 0., 1., 3., 2., 9., 2., 3.,
3., 6., 4., 2., 2., 1., 1., 1., 0., 0., 0., 2., 0., 0., 1., 1., 0.,
0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.]),
array([16.66666667, 17.52594738, 18.3852281 , 19.24450881, 20.10378952,
20.96307024, 21.82235095, 22.68163167, 23.54091238, 24.4001931 ,
25.25947381, 26.11875453, 26.97803524, 27.83731595, 28.69659667,
29.55587738, 30.4151581 , 31.27443881, 32.13371953, 32.99300024,
33.85228096, 34.71156167, 35.57084238, 36.4301231 , 37.28940381,
38.14868453, 39.00796524, 39.86724596, 40.72652667, 41.58580739,
42.4450881 , 43.30436881, 44.16364953, 45.02293024, 45.88221096,
46.74149167, 47.60077239, 48.4600531 , 49.31933382, 50.17861453,
51.03789524, 51.89717596, 52.75645667, 53.61573739, 54.4750181 ,
55.33429882, 56.19357953, 57.05286025, 57.91214096, 58.77142168,
59.63070239]),
<BarContainer object of 50 artists>)
sns.boxplot(y=Data["Percentage_won"])
<AxesSubplot:ylabel='Percentage_won'>
As shown in histogram, there are team with higher percentage won. Majority of team are around ~30% win percentage. We can focus on team with higher win percentage. As win percentage can be really good indicator of performance.
plt.hist(Data['Percentage_BasketScored'], bins=50)
(array([1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 2., 3., 5., 0., 3., 5.,
3., 4., 5., 1., 2., 5., 5., 3., 0., 0., 3., 0., 1., 0., 0., 1., 1.,
0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 2.]),
array([27.77777778, 28.53129699, 29.28481621, 30.03833543, 30.79185465,
31.54537386, 32.29889308, 33.0524123 , 33.80593151, 34.55945073,
35.31296995, 36.06648917, 36.82000838, 37.5735276 , 38.32704682,
39.08056603, 39.83408525, 40.58760447, 41.34112368, 42.0946429 ,
42.84816212, 43.60168134, 44.35520055, 45.10871977, 45.86223899,
46.6157582 , 47.36927742, 48.12279664, 48.87631585, 49.62983507,
50.38335429, 51.13687351, 51.89039272, 52.64391194, 53.39743116,
54.15095037, 54.90446959, 55.65798881, 56.41150802, 57.16502724,
57.91854646, 58.67206568, 59.42558489, 60.17910411, 60.93262333,
61.68614254, 62.43966176, 63.19318098, 63.94670019, 64.70021941,
65.45373863]),
<BarContainer object of 50 artists>)
sns.boxplot(y=Data["Percentage_BasketScored"])
<AxesSubplot:ylabel='Percentage_BasketScored'>
Percentage_BasketScored is ratio of scored to total basket(scored+given). This indicate that majority of team are around 35-50%. We can focus on team with higher percentage of basket scored.
plt.hist(Data['percentage_Finalist'], bins=50)
(array([47., 4., 1., 0., 1., 2., 0., 0., 0., 0., 1., 1., 0.,
1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.]),
array([ 0. , 1.30232558, 2.60465116, 3.90697674, 5.20930233,
6.51162791, 7.81395349, 9.11627907, 10.41860465, 11.72093023,
13.02325581, 14.3255814 , 15.62790698, 16.93023256, 18.23255814,
19.53488372, 20.8372093 , 22.13953488, 23.44186047, 24.74418605,
26.04651163, 27.34883721, 28.65116279, 29.95348837, 31.25581395,
32.55813953, 33.86046512, 35.1627907 , 36.46511628, 37.76744186,
39.06976744, 40.37209302, 41.6744186 , 42.97674419, 44.27906977,
45.58139535, 46.88372093, 48.18604651, 49.48837209, 50.79069767,
52.09302326, 53.39534884, 54.69767442, 56. , 57.30232558,
58.60465116, 59.90697674, 61.20930233, 62.51162791, 63.81395349,
65.11627907]),
<BarContainer object of 50 artists>)
sns.boxplot(y=Data["percentage_Finalist"])
<AxesSubplot:ylabel='percentage_Finalist'>
plt.hist(Data['Finalist'], bins=50)
(array([53., 0., 0., 0., 2., 1., 0., 0., 0., 0., 1., 0., 0.,
1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.]),
array([ 0. , 1.12, 2.24, 3.36, 4.48, 5.6 , 6.72, 7.84, 8.96,
10.08, 11.2 , 12.32, 13.44, 14.56, 15.68, 16.8 , 17.92, 19.04,
20.16, 21.28, 22.4 , 23.52, 24.64, 25.76, 26.88, 28. , 29.12,
30.24, 31.36, 32.48, 33.6 , 34.72, 35.84, 36.96, 38.08, 39.2 ,
40.32, 41.44, 42.56, 43.68, 44.8 , 45.92, 47.04, 48.16, 49.28,
50.4 , 51.52, 52.64, 53.76, 54.88, 56. ]),
<BarContainer object of 50 artists>)
plt.hist(Data['TournamentChampion'], bins=50)
(array([52., 3., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1.,
0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]),
array([ 0. , 0.66, 1.32, 1.98, 2.64, 3.3 , 3.96, 4.62, 5.28,
5.94, 6.6 , 7.26, 7.92, 8.58, 9.24, 9.9 , 10.56, 11.22,
11.88, 12.54, 13.2 , 13.86, 14.52, 15.18, 15.84, 16.5 , 17.16,
17.82, 18.48, 19.14, 19.8 , 20.46, 21.12, 21.78, 22.44, 23.1 ,
23.76, 24.42, 25.08, 25.74, 26.4 , 27.06, 27.72, 28.38, 29.04,
29.7 , 30.36, 31.02, 31.68, 32.34, 33. ]),
<BarContainer object of 50 artists>)
Percentage finalist, finalist and tournament championship is good indicator for the team performance. But since most of the team hasnt been finalist, it filters out most of team. Though finalist in tounanament is good indicator but with given data only few old teams are getting into finals
plt.hist(Data['HighestPositionHeld'], bins=50)
(array([9., 0., 5., 0., 0., 4., 0., 6., 0., 0., 4., 0., 0., 5., 0., 5., 0.,
0., 4., 0., 0., 2., 0., 4., 0., 0., 1., 0., 2., 0., 0., 0., 0., 0.,
1., 0., 1., 0., 0., 3., 0., 0., 3., 0., 0., 0., 0., 1., 0., 1.]),
array([ 1. , 1.38, 1.76, 2.14, 2.52, 2.9 , 3.28, 3.66, 4.04,
4.42, 4.8 , 5.18, 5.56, 5.94, 6.32, 6.7 , 7.08, 7.46,
7.84, 8.22, 8.6 , 8.98, 9.36, 9.74, 10.12, 10.5 , 10.88,
11.26, 11.64, 12.02, 12.4 , 12.78, 13.16, 13.54, 13.92, 14.3 ,
14.68, 15.06, 15.44, 15.82, 16.2 , 16.58, 16.96, 17.34, 17.72,
18.1 , 18.48, 18.86, 19.24, 19.62, 20. ]),
<BarContainer object of 50 artists>)
sns.boxplot(y=Data["HighestPositionHeld"])
<AxesSubplot:ylabel='HighestPositionHeld'>
Here we should focus on the teams with highest rank held. We are seeing high freqency of team with good performances. Company can invest on these teams as they have talent and capability to be on top ranking.
With these graphical representation we get fair idea about the performance of different team and we can make decision on investing on team within each category.
sns.pairplot(Data, kind='reg')
<seaborn.axisgrid.PairGrid at 0x7f92c2cfdcd0>
fig,ax = plt.subplots(figsize=(50,20))
sns.lineplot(Data['Team'], Data['Score'],ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Team', ylabel='Score'>
We can focus on team with higher scores like Team1 , Team2 and so on as they have fairly high scores.
fig,ax = plt.subplots(figsize=(20,20))
sns.heatmap(Data.corr(), annot=True, ax=ax)
<AxesSubplot:>
fig,ax = plt.subplots(figsize=(50,20))
sns.lineplot(Data['Team'], Data['Percentage_won'],ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Team', ylabel='Percentage_won'>
There are few spikes also which indicate the team with good percentage won. We can focus on team with higher percentage won like Team1 , Team2 and so on as they have fairly high scores.
Data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null int64 3 PlayedGames 61 non-null int64 4 WonGames 61 non-null int64 5 DrawnGames 61 non-null int64 6 LostGames 61 non-null int64 7 BasketScored 61 non-null int64 8 BasketGiven 61 non-null int64 9 TournamentChampion 61 non-null int64 10 Runner-up 61 non-null int64 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 13 TeamLaunchStartYear 61 non-null int64 14 Percentage_won 60 non-null float64 15 Percentage_BasketScored 60 non-null float64 16 Finalist 61 non-null int64 17 percentage_Finalist 61 non-null float64 18 percentage_TournamentChampion 61 non-null float64 dtypes: float64(4), int64(13), object(2) memory usage: 9.2+ KB
fig,ax = plt.subplots(figsize=(50,20))
sns.lineplot(Data['Team'], Data['percentage_TournamentChampion'],ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Team', ylabel='percentage_TournamentChampion'>
Percentage tournament champions shows fewer team with high percentage. This shows fewer team are dominating on tournaments.
fig,ax = plt.subplots(figsize=(50,20))
sns.lineplot(Data['Team'], Data['Percentage_BasketScored'],ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Team', ylabel='Percentage_BasketScored'>
There are few spikes also which indicate the team with good percentage basket scored. We can focus on team with higher percentage basket scored like Team1 , Team2 and so on as they have fairly high scores.
Data[['TeamLaunchStartYear','Score']].groupby(['TeamLaunchStartYear']).sum().plot(figsize=(15,5))
plt.show()
The graph shows earlier the team launched higher their scores. We can focus on the team with launch year earlier
fig,ax = plt.subplots(figsize=(50,20))
sns.scatterplot(Data['TeamLaunchStartYear'], Data['Score'],hue= Data['Team'], ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='TeamLaunchStartYear', ylabel='Score'>
We can focus on team that are launch on different year but with higher scores.
fig,ax = plt.subplots(figsize=(50,20))
sns.lineplot(Data['Team'], Data['Finalist'],ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Team', ylabel='Finalist'>
The finalist are basiclly restricted to fewer team.
fig,ax = plt.subplots(figsize=(50,20))
sns.lineplot(Data['Team'], Data['percentage_Finalist'],ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Team', ylabel='percentage_Finalist'>
We can focus on team with good percentage finalist like team1, team 2 and so on
fig,ax = plt.subplots(figsize=(50,20))
sns.lineplot(Data['Team'], Data['TournamentChampion'],ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Team', ylabel='TournamentChampion'>
We can focus on team with higher championship like team1, team 2 and so on
fig,ax = plt.subplots(figsize=(50,20))
sns.lineplot(Data['Team'], Data['HighestPositionHeld'],ax=ax)
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Team', ylabel='HighestPositionHeld'>
Here we can focus on team like 1,2,3,4 and so on with faily good HighestPositionHeld.
#import pandas_profiling
import pandas_profiling
data_profile = pd.read_csv('DS - Part2 - Basketball.csv')
pandas_profiling.ProfileReport(data_profile)
Data analysis:
1. Lots of missing data and size of data is less. Once we remove the missing data, there are only 8-10 row of data remaining.
2. All the rows having high correlation means lesser variability or richness of data.
3. As in report, most data are uniform.
4. There is high cardinality in data which makes them difficult to categorise or group.
5. Data is highly skewed.
It can be seen that score is highly correlated to most of the variables shown in correlation graph
As shown in heat map all the variable except Highest Position held and Team launch start year all the data are related directly and Highest Position held and Team launch start year is inversely related as expected
It can be observed that most of the data are right skewed. Percentage scored and percentage won is normal Data are correlated and linear in nature with outliers also
As shown, most frequent values are low and tail is towards high values.
As shown in histogram, frequency of team is distributed around different scores. There are team with highers scores like greater than 1K, 2K, 3K and 4K. Since score is fairly good indicator of performance(as shown in correlation) we can focus on team with higher scores.
As shown in histogram, there are team with higher percentage won. Majority of team are around ~30% win percentage. We can focus on team with higher win percentage. As win percentage can be really good indicator of performance.
Percentage_BasketScored is ratio of scored to total basket(scored+given). This indicate that majority of team are around 35-50%. We can focus on team with higher percentage of basket scored.
Percentage finalist, finalist and tournament championship is good indicator for the team performance. But since most of the team hasnt been finalist, it filters out most of team. Though finalist in tournament is good indicator but with given data only few old teams are getting into finals
Here we should focus on the teams with highest rank held. We are seeing high freqency of team with good performances. Company can invest on these teams as they have talent and capability to be on top ranking.
With these graphical representation we get fair idea about the performance of different team and we can make decision on investing on team within each category.
We can focus on team with higher scores like Team1 , Team2 and so on as they have fairly high scores.
1. It can be seen that score is highly correlated to most of the variables shown in correlation graph
2. As shown in heat map all the variable except Highest Position held and Team launch start year all the data are related directly and Highest Position held and Team launch start year is inversely related as expected
3. But the score is also correlated to losses and negative items
There are few spikes also which indicate the team with good percentage won. We can focus on team with higher percentage won like Team1 , Team2 and so on as they have fairly high scores.
Percentage tournament champions shows fewer team with high percentage. This shows fewer team are dominating on tournaments.
There are few spikes also which indicate the team with good percentage basket scored. We can focus on team with higher percentage basket scored like Team1 , Team2 and so on as they have fairly high scores.
The graph shows earlier the team launched higher their scores. We can focus on the team with launch year earlier
We can focus on team that are launch on different year but with higher scores.
The finalist are basically restricted to fewer team.
We can focus on team with good percentage finalist like team1, team 2 and so on
We can focus on team with higher championship like team1, team 2 and so on
Here we can focus on team like 1,2,3,4 and so on with fairy good HighestPositionHeld.
Hence we can conclude that the teams like Team 1,2,3,4 and so on since it has good record over the years and highest percentage and number of scores, winning, finalist, basket and tournament championship.
No Null Data Present
No duplicate data found
No Null Data Present
Since TournamentChampion(48 -) and Runner-up(52 -) are having most entries as '-', we will be replacing with 0 which means they havent won any tournament or runner-up. This will be best solution as replacing with average or removing the data will not be apt solution.
Replacing the blank data '-' with value 0 as we are having missing data on them(Assuming they havent played matches). Hence making data type uniform.
Since the column TeamLaunch data is not uniform and contain year and duration, creating a new column with capturing the startng year by parsing the data on the column
Convert the object data to numeric so that the functions can be applied
Insert colums below for better analysis
1. Percentage_won = WonGames /PlayedGames *100
2. Percentage_BasketScored = BasketScored/ (BasketScored + BasketGiven) *100
3. Finalist = TournamentChampion + Runner-up
4. percentage_Finalist = (TournamentChampion + Runner-up ) / Tournament *100
5. percentage_TournamentChampion = (TournamentChampion) / Tournament
Analysed basic structure of data
We have completed the data filtering and processing.
We have removed duplicate and null data.
Replaced data.
Processed the duration to start year.
Changed data type.
Introduced new columns for better analysis.
We are not sure here about the missing data '-' , which could have been improved with some numberical value. Columns like TournamentChampion and Runner-up have lots of missing data which makes it difficult to analyse.
• DOMAIN: Startup ecosystem
• CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.
• DATA DESCRIPTION: CompanyX_EU.csv - Each row in the dataset is a Start-up company and the columns describe the company. ATTRIBUTE INFORMATION:
1. Startup: Name of the company
2. Product: Actual product
3. Funding: Funds raised by the company in USD
4. Event: The event the company participated in
5. Result: Described by Contestant, Finalist, Audience choice, Winner or Runner up
6. OperatingState: Current status of the company, Operating ,Closed, Acquired or IPO
*Dataset has been downloaded from the internet. All the credit for the dataset goes to the original creator of the data.
• PROJECT OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True) # adds a nice background to the graphs
%matplotlib inline
data= pd.read_csv('DS - Part3 - CompanyX_EU.csv')
data.head() # view the first 5 rows of the data
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
From the table, its obeserved that there are 6 columns with below characteristics:
1. Startup : Object type and Unique values for each data
2. Product : Object type and Unique values for each data
3. Funding : Funding is continuous value, as we can see we have to process the data as its in non-uniform format
4. Event : Event is categorical data with different list of event names
5. Result : Categorical data
6. OperatingState : Cateforical data
data.dtypes
Startup object Product object Funding object Event object Result object OperatingState object dtype: object
All Data type is currently object
We need the column 'Funding' to be of numerical data type
data.shape
(662, 6)
Data has 662 rows and 6 columns
data.describe()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| count | 662 | 656 | 448 | 662 | 662 | 662 |
| unique | 662 | 656 | 240 | 26 | 5 | 4 |
| top | DotSpots | artveoli.com | $1M | TC50 2008 | Contestant | Operating |
| freq | 1 | 1 | 17 | 52 | 488 | 465 |
*
data.Result.value_counts()
Contestant 488 Finalist 84 Audience choice 41 Winner 26 Runner up 23 Name: Result, dtype: int64
data.OperatingState.value_counts()
Operating 465 Closed 106 Acquired 86 Ipo 5 Name: OperatingState, dtype: int64
data.Funding.value_counts()
$1M 17
$2M 12
$3M 9
$1.2M 9
$1.3M 9
..
$12.8M 1
$64M 1
$35.4M 1
$9.7M 1
$590K 1
Name: Funding, Length: 240, dtype: int64
data.isnull().values.any()
True
data.isnull().sum()
Startup 0 Product 6 Funding 214 Event 0 Result 0 OperatingState 0 dtype: int64
There are missing data in Product(6) and Funding(214)
dupes = data.duplicated()
sum(dupes)
0
data.dropna(inplace=True)
data.isnull().sum()
Startup 0 Product 0 Funding 0 Event 0 Result 0 OperatingState 0 dtype: int64
data.isnull().values.any()
False
data.Funding.value_counts()
$1M 17
$2M 12
$1.2M 9
$1.3M 9
$3M 9
..
$36.5M 1
$103M 1
$35.4M 1
$5K 1
$684.4K 1
Name: Funding, Length: 239, dtype: int64
Replace $ in Funding
data.Funding = data.Funding.apply(lambda x:x.replace('$',''))
data
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | 630K | Disrupt NYC 2013 | Contestant | Closed |
| 3 | 3Dprintler | 3dprintler.com | 1M | Disrupt NY 2016 | Audience choice | Operating |
| 5 | 5to1 | 5to1.com | 19.3M | TC50 2009 | Contestant | Acquired |
| 6 | 8 Securities | 8securities.com | 29M | Disrupt Beijing 2011 | Finalist | Operating |
| 10 | AdhereTech | adheretech.com | 1.8M | Hardware Battlefield 2014 | Contestant | Operating |
| ... | ... | ... | ... | ... | ... | ... |
| 657 | Zivity | zivity.com | 8M | TC40 2007 | Contestant | Operating |
| 658 | Zmorph | zmorph3d.com | 1M | - | Audience choice | Operating |
| 659 | Zocdoc | zocdoc.com | 223M | TC40 2007 | Contestant | Operating |
| 660 | Zula | zulaapp.com | 3.4M | Disrupt SF 2013 | Audience choice | Operating |
| 661 | Zumper | zumper.com | 31.5M | Disrupt SF 2012 | Finalist | Operating |
446 rows × 6 columns
Here we are replacing the notation K M B with blank first, then extract the decimal and multiply for K with 10^3, M with 10^6 and B with 10^9 . After this we are converting the type as int
data.Funding = (data.Funding.replace(r'[KMB]+$', '', regex=True).astype(float) * \
data.Funding.str.extract(r'[\d\.]+([KMB]+)', expand=False).
fillna(1).replace(['K','M', 'B'], [10**3, 10**6, 10**9]).astype(int))
data
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | 630000.0 | Disrupt NYC 2013 | Contestant | Closed |
| 3 | 3Dprintler | 3dprintler.com | 1000000.0 | Disrupt NY 2016 | Audience choice | Operating |
| 5 | 5to1 | 5to1.com | 19300000.0 | TC50 2009 | Contestant | Acquired |
| 6 | 8 Securities | 8securities.com | 29000000.0 | Disrupt Beijing 2011 | Finalist | Operating |
| 10 | AdhereTech | adheretech.com | 1800000.0 | Hardware Battlefield 2014 | Contestant | Operating |
| ... | ... | ... | ... | ... | ... | ... |
| 657 | Zivity | zivity.com | 8000000.0 | TC40 2007 | Contestant | Operating |
| 658 | Zmorph | zmorph3d.com | 1000000.0 | - | Audience choice | Operating |
| 659 | Zocdoc | zocdoc.com | 223000000.0 | TC40 2007 | Contestant | Operating |
| 660 | Zula | zulaapp.com | 3400000.0 | Disrupt SF 2013 | Audience choice | Operating |
| 661 | Zumper | zumper.com | 31500000.0 | Disrupt SF 2012 | Finalist | Operating |
446 rows × 6 columns
data["Funding(in Millions)"]=data.Funding/10**6
data
| Startup | Product | Funding | Event | Result | OperatingState | Funding(in Millions) | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | 630000.0 | Disrupt NYC 2013 | Contestant | Closed | 0.63 |
| 3 | 3Dprintler | 3dprintler.com | 1000000.0 | Disrupt NY 2016 | Audience choice | Operating | 1.00 |
| 5 | 5to1 | 5to1.com | 19300000.0 | TC50 2009 | Contestant | Acquired | 19.30 |
| 6 | 8 Securities | 8securities.com | 29000000.0 | Disrupt Beijing 2011 | Finalist | Operating | 29.00 |
| 10 | AdhereTech | adheretech.com | 1800000.0 | Hardware Battlefield 2014 | Contestant | Operating | 1.80 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 657 | Zivity | zivity.com | 8000000.0 | TC40 2007 | Contestant | Operating | 8.00 |
| 658 | Zmorph | zmorph3d.com | 1000000.0 | - | Audience choice | Operating | 1.00 |
| 659 | Zocdoc | zocdoc.com | 223000000.0 | TC40 2007 | Contestant | Operating | 223.00 |
| 660 | Zula | zulaapp.com | 3400000.0 | Disrupt SF 2013 | Audience choice | Operating | 3.40 |
| 661 | Zumper | zumper.com | 31500000.0 | Disrupt SF 2012 | Finalist | Operating | 31.50 |
446 rows × 7 columns
data.describe()
| Funding | Funding(in Millions) | |
|---|---|---|
| count | 4.460000e+02 | 446.000000 |
| mean | 1.724149e+07 | 17.241489 |
| std | 9.048371e+07 | 90.483710 |
| min | 5.000000e+03 | 0.005000 |
| 25% | 7.452500e+05 | 0.745250 |
| 50% | 2.200000e+06 | 2.200000 |
| 75% | 9.475000e+06 | 9.475000 |
| max | 1.700000e+09 | 1700.000000 |
%matplotlib inline
from matplotlib import pyplot as plt
plot = plt.boxplot(data["Funding(in Millions)"])
plt.title('Boxplot of the funds')
plt.ylabel("Funding(in Millions)")
plt.show()
There are significant outliers in the data. Since box is squeezed due to data and box is not visible. Plotting with only only box plox to having better vissibility on quartiles
sns.set_theme(style="whitegrid")
sns.set(rc = {'figure.figsize':(15,8)})
sns.boxplot(x=(data["Funding"]/(10**6)),showfliers = False)
<AxesSubplot:xlabel='Funding'>
lower_fence = plot['caps'][0].get_data()[1][1]
lower_fence
0.005
Lower fence in box plot : 0.005
[item.get_ydata() for item in plot['whiskers']]
[array([0.74525, 0.005 ]), array([ 9.475, 22. ])]
# Other details
median_Funding = data.Funding.median()
Q1_Funding = data.Funding.quantile(q=0.25)
Q3_Funding = data.Funding.quantile(q=0.75)
min_Funding = data.Funding.min()
IQR = Q3_Funding-Q1_Funding
print(f"Funding : \nMedian : {median_Funding} \nQ1 :{Q1_Funding}\nQ3 : {Q3_Funding} \nMin : {min_Funding} \nIQR : {IQR} ")
print(f"Upper whisker : {Q3_Funding + 1.5* IQR}")
print(f"Lower whisker : {(Q1_Funding - 1.5* IQR) if (Q1_Funding - 1.5* IQR)>0 else 0}")
Funding : Median : 2200000.0 Q1 :745250.0 Q3 : 9475000.0 Min : 5000.0 IQR : 8729750.0 Upper whisker : 22569625.0 Lower whisker : 0
data_Funding_outliers =data.Funding[data.Funding > (Q3_Funding + 1.5 * IQR)]
len(data_Funding_outliers)
60
There are 60 outliers grater than upper fence
df_out = data.copy()
# Keeping the values lesser than upper whiskers =Q3_Funding + 1.5 * IQR
df_out = df_out[data.Funding < (Q3_Funding + 1.5 * IQR)]
df_out.shape
df_out
| Startup | Product | Funding | Event | Result | OperatingState | Funding(in Millions) | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | 630000.0 | Disrupt NYC 2013 | Contestant | Closed | 0.63 |
| 3 | 3Dprintler | 3dprintler.com | 1000000.0 | Disrupt NY 2016 | Audience choice | Operating | 1.00 |
| 5 | 5to1 | 5to1.com | 19300000.0 | TC50 2009 | Contestant | Acquired | 19.30 |
| 10 | AdhereTech | adheretech.com | 1800000.0 | Hardware Battlefield 2014 | Contestant | Operating | 1.80 |
| 11 | AdRocket | adrocket.com | 1000000.0 | TC50 2008 | Contestant | Closed | 1.00 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 645 | Yap | yapme.com | 10000000.0 | TC40 2007 | Contestant | Closed | 10.00 |
| 646 | YayPay Inc | yaypay.com | 900000.0 | Disrupt London 2015 | Contestant | Operating | 0.90 |
| 657 | Zivity | zivity.com | 8000000.0 | TC40 2007 | Contestant | Operating | 8.00 |
| 658 | Zmorph | zmorph3d.com | 1000000.0 | - | Audience choice | Operating | 1.00 |
| 660 | Zula | zulaapp.com | 3400000.0 | Disrupt SF 2013 | Audience choice | Operating | 3.40 |
386 rows × 7 columns
sns.set_theme(style="whitegrid")
sns.set(rc = {'figure.figsize':(15,8)})
sns.boxplot(x=(df_out["Funding"]/(10**6)))
<AxesSubplot:xlabel='Funding'>
df_out.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 386 entries, 1 to 660 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 386 non-null object 1 Product 386 non-null object 2 Funding 386 non-null float64 3 Event 386 non-null object 4 Result 386 non-null object 5 OperatingState 386 non-null object 6 Funding(in Millions) 386 non-null float64 dtypes: float64(2), object(5) memory usage: 24.1+ KB
df_out.OperatingState.value_counts()
Operating 275 Closed 56 Acquired 55 Name: OperatingState, dtype: int64
275 companies are operating, 56 are closed and 55 are acquired
plt.hist(df_out.OperatingState)
(array([ 56., 0., 0., 0., 0., 275., 0., 0., 0., 55.]), array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]), <BarContainer object of 10 artists>)
sns.displot(x=(df_out["Funding(in Millions)"]),bins=30)
plt.title('Distribution plot for Funds in million.')
plt.show()
plt.hist(df_out["Funding"]/(10**6), bins=50)
(array([77., 50., 49., 20., 28., 13., 15., 14., 6., 10., 5., 6., 5.,
5., 6., 5., 8., 2., 6., 2., 1., 2., 8., 2., 3., 1.,
2., 6., 3., 3., 0., 0., 3., 0., 1., 2., 2., 3., 0.,
1., 2., 2., 1., 2., 0., 0., 2., 1., 0., 1.]),
array([5.00000e-03, 4.44900e-01, 8.84800e-01, 1.32470e+00, 1.76460e+00,
2.20450e+00, 2.64440e+00, 3.08430e+00, 3.52420e+00, 3.96410e+00,
4.40400e+00, 4.84390e+00, 5.28380e+00, 5.72370e+00, 6.16360e+00,
6.60350e+00, 7.04340e+00, 7.48330e+00, 7.92320e+00, 8.36310e+00,
8.80300e+00, 9.24290e+00, 9.68280e+00, 1.01227e+01, 1.05626e+01,
1.10025e+01, 1.14424e+01, 1.18823e+01, 1.23222e+01, 1.27621e+01,
1.32020e+01, 1.36419e+01, 1.40818e+01, 1.45217e+01, 1.49616e+01,
1.54015e+01, 1.58414e+01, 1.62813e+01, 1.67212e+01, 1.71611e+01,
1.76010e+01, 1.80409e+01, 1.84808e+01, 1.89207e+01, 1.93606e+01,
1.98005e+01, 2.02404e+01, 2.06803e+01, 2.11202e+01, 2.15601e+01,
2.20000e+01]),
<BarContainer object of 50 artists>)
df_out["Funding(in Millions)"].describe()
count 386.00000 mean 3.72514 std 4.73236 min 0.00500 25% 0.60000 50% 1.70000 75% 5.00000 max 22.00000 Name: Funding(in Millions), dtype: float64
Data is pretty heavily skewed
# Filter out companies still operating and companies that closed.
df_ops= df_out.OperatingState[(df_out.OperatingState == "Operating") | (df_out.OperatingState == "Closed") ]
df_ops.value_counts()
Operating 275 Closed 56 Name: OperatingState, dtype: int64
fig, ax = plt.subplots(1, 2)
fig.set_figheight(5)
fig.set_figwidth(15)
sns.distplot(df_out.loc[df_out.OperatingState == 'Operating', 'Funding(in Millions)'], ax = ax[0])
sns.distplot(df_out.loc[df_out.OperatingState =='Closed', 'Funding(in Millions)'], ax = ax[1])
ax[0].set_title('Funding in Millions by the companies still operating')
ax[1].set_title('Funding in Millions by companies that got closed')
plt.show()
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
sns.countplot(x =df_ops, palette="Set2")
<AxesSubplot:xlabel='OperatingState', ylabel='count'>
1. Write the null hypothesis and alternative hypothesis.
2. Test for significance and conclusion
ANSWER:
H0: x̄1 = x̄2, or x̄2 - x̄1 = 0, that is , there is no difference between the sample means
HA: x̄2 < x̄1, or x̄2 - x̄1 < 0
Lets consider a significance level of 5%
α = 0.05
from statsmodels.stats.weightstats import ztest
x1=df_out.Funding[df_out.OperatingState == "Operating"]
x2=df_out.Funding[df_out.OperatingState == "Closed"]
N1 = len(x1)
N2 = len(x2)
alpha = 0.05
test_statistic, p_value = ztest(x1, x2)
if p_value <= alpha:
print(f'Since the p-value, {round(p_value, 3)} < {alpha} (alpha) the difference is significant and we reject the Null hypothesis')
else:
print(f"Since the p-value, {round(p_value,3)} > {alpha} (alpha) the difference is not significant and, we fail to reject the Null hypothesis")
Since the p-value, 0.192 > 0.05 (alpha) the difference is not significant and, we fail to reject the Null hypothesis
from scipy.stats import ttest_ind
test_statistic, p_value = ttest_ind(x1, x2)
if p_value <= alpha:
print(f'Since the p-value, {round(p_value, 3)} < {alpha} (alpha) the difference is significant and we reject the Null hypothesis')
else:
print(f"Since the p-value, {round(p_value,3)} > {alpha} (alpha) the difference is not significant and, we fail to reject the Null hypothesis")
Since the p-value, 0.193 > 0.05 (alpha) the difference is not significant and, we fail to reject the Null hypothesis
Hence we can say that we are failing to reject the hypothesis that there is no difference between the sample means
So we can say that there is no evidence found from the data that the company that has more funding succeed more
data_new= pd.read_csv('DS - Part3 - CompanyX_EU.csv')
df_c = data_new.copy(deep = True)
df_c.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
df_c.Result.value_counts()
Contestant 488 Finalist 84 Audience choice 41 Winner 26 Runner up 23 Name: Result, dtype: int64
winners = df_c.Result.value_counts()[1:].sum()
contestants = df_c.Result.value_counts()['Contestant']
contestants_operating = df_c.OperatingState[df_c.Result == 'Contestant'].value_counts().loc['Operating']
winners_operating = df_c.OperatingState[df_c.Result != 'Contestant'].value_counts().loc['Operating']
print(f"percentage of winners that are still operating: {(winners_operating/winners) *100}")
print(f"percentage of contestants that are still operating: {(contestants_operating/contestants)*100}")
percentage of winners that are still operating: 76.4367816091954 percentage of contestants that are still operating: 68.0327868852459
Write the null hypothesis and alternative hypothesis.
Test for significance and conclusion
Null hyputhesis (Ho): The proportion of companies that are operating is the same in both categories - winners and contestants
Alternative hypothesis (Ha): The proportion of companies that are operating is significantly different from each other, among the two categories
from statsmodels.stats.proportion import proportions_ztest
test_statistic, p_value = proportions_ztest([contestants_operating, winners_operating], [contestants, winners])
if p_value <= alpha:
print(f'Since the p-value, {round(p_value, 3)} < {alpha} (alpha) the difference is significant and we reject the Null hypothesis')
else:
print(f"Since the p-value, {round(p_value,3)} > {alpha} (alpha) the difference is not significant and, we fail to reject the Null hypothesis")
Since the p-value, 0.037 < 0.05 (alpha) the difference is significant and we reject the Null hypothesis
Hence we can say that the winner in the events tend to be more operational, as there is significant different in proportion of companies that are operating and are winner and contestant
df_out.Event.value_counts()
TC50 2008 25 TC40 2007 22 Disrupt NY 2015 21 Disrupt NYC 2012 19 Disrupt SF 2014 19 Disrupt SF 2013 19 Disrupt SF 2011 19 Disrupt SF 2015 19 Disrupt NYC 2013 19 TC50 2009 19 Disrupt SF 2016 17 Disrupt NY 2016 16 Disrupt NYC 2011 15 Disrupt NYC 2014 15 Disrupt SF 2012 15 Disrupt SF 2010 13 Hardware Battlefield 2016 12 Hardware Battlefield 2014 12 Disrupt London 2015 11 Disrupt EU 2014 10 Disrupt NYC 2010 10 Hardware Battlefield 2015 10 Disrupt London 2016 10 Disrupt EU 2013 9 - 6 Disrupt Beijing 2011 4 Name: Event, dtype: int64
disrupt_events = df_out[df_out.Event.apply(lambda x: 'Disrupt' in x)].Event.value_counts()
disrupt_events
Disrupt NY 2015 21 Disrupt SF 2011 19 Disrupt SF 2014 19 Disrupt SF 2015 19 Disrupt NYC 2012 19 Disrupt SF 2013 19 Disrupt NYC 2013 19 Disrupt SF 2016 17 Disrupt NY 2016 16 Disrupt NYC 2014 15 Disrupt SF 2012 15 Disrupt NYC 2011 15 Disrupt SF 2010 13 Disrupt London 2015 11 Disrupt NYC 2010 10 Disrupt EU 2014 10 Disrupt London 2016 10 Disrupt EU 2013 9 Disrupt Beijing 2011 4 Name: Event, dtype: int64
disrupt_events_2013_grt = df_out[df_out.Event.apply(lambda x: 'Disrupt' in x and int(x[-4:]) > 2012)].Event
disrupt_events_2013_grt
1 Disrupt NYC 2013
3 Disrupt NY 2016
13 Disrupt SF 2015
14 Disrupt London 2016
16 Disrupt SF 2015
...
635 Disrupt NY 2015
641 Disrupt NYC 2013
642 Disrupt SF 2014
646 Disrupt London 2015
660 Disrupt SF 2013
Name: Event, Length: 185, dtype: object
NY_events = df_out.loc[disrupt_events_2013_grt[disrupt_events_2013_grt.apply(lambda x: 'NY' in x)].index, 'Funding(in Millions)']
SF_events = df_out.loc[disrupt_events_2013_grt[disrupt_events_2013_grt.apply(lambda x: 'SF' in x)].index, 'Funding(in Millions)']
EU_events = df_out.loc[disrupt_events_2013_grt[disrupt_events_2013_grt.apply(lambda x: 'EU' in x or 'London' in x)].index, 'Funding(in Millions)']
print(len(NY_events), len(SF_events), len(EU_events))
71 74 40
Null Hypothesis(Ho): Average funds raised by companies across three cities are the same : x̄NY = x̄SF = x̄EU, that is , there is no difference between the sample means that is , there is no difference between the sample means
Alternative Hypothesis(Ha): Average funds raised by companies across three cities are the different : x̄NY != x̄SF != x̄EU
from scipy.stats import f_oneway
stat, p_value = f_oneway(NY_events, SF_events, EU_events)
if p_value <= alpha:
print(f'Since the p-value, {round(p_value, 3)} < {alpha} (alpha) the difference is significant and we reject the Null hypothesis')
else:
print(f"Since the p-value, {round(p_value,3)} > {alpha} (alpha) the difference is not significant and, we fail to reject the Null hypothesis")
Since the p-value, 0.628 > 0.05 (alpha) the difference is not significant and, we fail to reject the Null hypothesis
Since we fail to reject the null hypothesis, we can say that there is no evidence to claim companies participating in certain regions have funds either significantly on the higher side or on the lower side
plt.figure(figsize=(12,5))
sns.distplot(NY_events, label = 'NY')
sns.distplot(SF_events, label = 'SF')
sns.distplot(EU_events, label = 'EU')
plt.legend()
plt.show()
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
fig, ax = plt.subplots(1, 3)
fig.set_figheight(5)
fig.set_figheight(5)
sns.distplot(NY_events, label = 'NY', ax = ax[0])
sns.distplot(SF_events, label = 'SF', ax = ax[1])
sns.distplot(EU_events, label = 'EU', ax = ax[2])
ax[0].set_title('NY')
ax[1].set_title('SF')
ax[2].set_title('EU')
plt.show()
/Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /Users/santoshsingh/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Modes of the three distributions are similar
Dispersion in NY quiet high compared to the others
Overall Distributions look quiet similar to eyes
#import pandas_profiling
import pandas_profiling
pandas_profiling.ProfileReport(data_new)
1. Funding has 214 (32.3%) missing values
2. Data is pretty heavily skewed
3. Even after removal of lot of outliers the scale was not good enough
4. We do not have absolute numbers to directly use in our tests
5. We have 220 missing data, which can be improved to get more accurate observation
6. The distributions are not normal and have too many outliers
7. Event is highly correlated with OperatingState
8. We have lot of outliers in function which can say that the data is not uniform.